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CNl ■ Abstract 

A class of improved estimators is proposed for N-point correlation functions of galaxy clustering, 
and for discrete spatial random processes in general. In the limit of weak clustering, the variance of the 
unbiased estimator converges to the continuum value much faster than with any alternative, all terms 
giving rise to a slower convergence exactly cancel. Explicit variance formulae are provided for both Poisson 
and multinomial point processes using techniques for spatial statistics reported by Ripley (1988). The 
formalism naturally includes most previously used statistical tools such as ./V-point correlation functions 
and their Fourier counterparts, moments of counts-in-cells, and moment correlators. For all these, and 
perhaps some other statistics our estimator provides a straightforward means for efficient edge corrections. 
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1. Introduction 

As correlation functions are some of the most useful descriptors of galaxy clustering, their 
accurate estimation is of utmost importance. There is a considerable spread in opinions on what 
defines an optimal estimator. Since the early work of Peebles (1980 and references therein) and 
coworkers (e.g. Peebles Groth 1975j , Fry fc Peebles 1978; ) the simple DD / RR — 1 estimator 



was widely used, where DD symbolically denotes the number of galaxy pairs at a given range of 
separations, and RR denotes the number of random pairs generated over a similar area as the data. 
It has been known for some time that data close to the edges of a survey should have different 
weights, and for angular correlations an improved estimator has been introduced: DD/DR — 1, 
where DR denotes pairs of random and data points. This method computes the contributions 
from galaxies near the edge in a prorated fashion, via the cross correlations ( Hewett 1982| ). Landy 
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and Szalay (1993, hereafter LS) have shown that by using (DD — 2DR + RR)/RR one can carry 
the above argument even further. In the limit of weak clustering the variance is proportional to 
1/n 2 , i.e. Poissonian in the pair counts, whereas for all the other estimators the leading order 
term is 1/n, where n is the number of points in the survey. Feldman, Kaiser, fc Peacock 199"4] 



employed the Fourier equivalent of this expression, and Hamilton 1992 advocated DD .RR/ DR 



which, except for a small bias, behaves like the LS estimator. [Bernstein 1994| has generalized 
the LS estimator by including explicitly the effects from higher order clustering. Besides of these 
direct estimators ensemble estimators were in use as well. [Limber 1954 , Neyman, Scott, fc Shane 
1956| , proth fc Peebles 1977| used ((iVi - {N))(N 2 - (N))) / {N) 2 , which is essentially equivalent 
to the [LS| estimator. They did not emphasize the difference in the variance of equivalent unbiased 
estimators, and to date other forms such as (N1N2) / (N) 2 — 1, and (N1N2) / {N\) (N2) — 1 became 
wide spread, despite the extra variance they contain. In the statistics literature, probably the 
most comprehensive review of the pair estimators is by Ripley 198S| , who discusses several ways of 
performing the edge corrections. 

At the same time, the estimators for the higher order correlation functions are less well 
understood. The first theoretical attempt in understanding variance of higher order correlations 
was |Colombi, Bouchet fe Schaeffer 19~95 , where they calculated the variance of the void probability 



function, Po- Recently Szapudi &i Colombi 1996 identified and calculated the contributions to 
the variance of moments of counts in cells from the finite survey size, shot noise, the geometry 
of the survey, as well as measurement effects arising from the finite number of sampling cells. 
These theoretical computations were applied for realistic survey properties, and the effects of 
sampling were discussed in Colombi, Szapudi, fc Szalay 1997] , while the results were extended 
for the cumulants and for the weakly non-linear regime by [Szapudi, Colombi, &: Bernardeau 



1997, Finally, Matterese, Verde, &; Heavens 1997 discussed the variance of the bispectrum. The 



importance and distinguishing power of the higher order functions depends on how accurately 
they can be measured, thus any gain in lowering the variance is important. 

The main goals of this Letter are (i) to explain the advantageous properties of the LS type 
estimators in simple terms, (ii) generalize them for arbitrary high order, and for any statistics 
depending on iV-tuplets of discrete point processes, (iii) strongly advocate their use instead of 
traditional estimators with larger variance. We show that for this class of estimators the variance 
from shot noise is Poissonian in the number of N-tuplets, i.e. 1/n . §2 discusses the physical 
reasons, why these estimators are superior to earlier ones. In §3 a short but exact derivation of the 
results is provided for Poissonian and multinomial point processes. The latter represents a survey 
with a given number of galaxies in it. The importance of the results, possible generalizations, and 
practical issues are discussed in §4. 



2. Minimal Variance Estimators 



According to the previous summary of estimators for second order processes the 
(DD — 2DR + RR) / RR estimator has superior shot noise behavior compared to the existing 
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alternatives. While the calculations in the next section prove this statement, here a simple 
argument is given. 

The usual £ = DD/RR — 1 estimator can be expressed by the sample average of 5, the 
dimensionless overdensity, as 

£ = <(1 + + 5 2 )) s -l = (5 1 + 5 2 + <W, , (1) 

where (. . .) denotes the sample average. This is an unbiased estimator, since in the ensemble 
average {5) = 0, however, the presence of the linear 5 terms will add to the variance. Using only 
the absolutely necessary term, 6162, will obviously decrease the variance. The LS estimator on the 
other hand can be rewritten as £2 = [P\ — R\){D2 — R 2 )/RR = {Si52) s , i.e. it has the simplest 
possible structure in 5. This estimator and its Fourier counterpart give the smallest variance, 
since any other estimator contains extra terms, therefore the total variance can be expressed, as 
the variance of the LS estimator, plus extra terms, 

Varf = + 5 2 ) 2 + 2b\b2 + 26^) + Var($i$ a ). (2) 

Although here we do not attempt to prove, that the extra term is indeed positive, the next section 
shows mathematically that the LS estimator has minimal shot noise behavior. 
I 1 

Ripley 1988 has discussed extensively the Poisson variance of second order point processes. 
He has shown, that the term proportional to 1/n in the variance of the simple estimator £o> K$ 
in Ripley's notation, is also proportional to u, the perimeter for a two dimensional survey. This 
implies that the effect is due to inadequate edge-corrections, in agreement with Hewett's (1982) 
original suggestion. The subtraction of the appropriate DR terms is equivalent to an optimal 
edge-correction. 

The effect of the unnecessary terms in the estimator on the variance is even more pronounced 
for the higher order functions, since there will be a lot more terms arising through various 
combinatorial expressions. Here we propose, that the obvious generalization for higher order 
correlations is to create the higher order equivalents of the LS estimator, corresponding to 
(5i...5n). In the symbolic notation, this estimator can be written as 

£ N = (£>i - Rx).(D 2 - R 2 )...(D N - R N )/R 1 ..R N , (3) 

with the exact meaning discussed in the next section in rigorous mathematical fashion. This 



estimator differs from the classic estimator proposed by e.g. Peebles fc Groth 1975 for the 3-point 



function (DDD — DDR)/RRR + 2 which contains extra variance for the above explained reasons. 



Note, however, that the counts-in-cells estimators proposed by Peebles 1975 for the three point 



function and by Peebles 198C for the four-point function are the counts-in-cells equivalent of 



the above equation. It is worth to emphasize again, that for both counts in cells and direct 
estimators, the above form eliminates the excess variance. On the other hand this form requires 
some correction to obtain the irreducible, or "connected" correlations. These generalizations will 
be discussed in the last section. 
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It will be shown next in a general fashion that the above form constitutes an optimal 
estimator in the Poisson and multinomial limits. The elegant formalism outlined in [Ripley 1988 
enabled us to perform most of the calculations in a compact and general form. This work is highly 
recommended as a reference for mathematical details omitted here. 



3. Derivation of the Edge Corrected Estimator 

Let D be a catalog of data points to be analyzed, and R randomly generated over the same 
area, with averages A, and p respectively. The role of R is to perform a Monte Carlo integration 
compensating for edge effects, therefore eventually the limit p — > oo will be taken. A on the 
other hand is assumed to be externally estimated with arbitrary precision. Many interesting 
statistics, such as the iV-point correlation functions and their Fourier analogs, can be formulated 
as functions over iV-points from the catalog. The covariance of a pair of such estimators will be 
calculated in the Poisson and multinomial limits. They correspond to the cases, where the number 
of detected objects is varied or fixed a priori. Finally the results are given for the general case, 
where correlations are non-negligible. 

Let us define symbolically an estimator D p R q , with p + q = N for a function $ symmetric in 
its arguments 

D iRV = ^{x 1 ,...,x p> y u ...,y <l \ (4) 

with Xi ^ Xj E D, yi ^ yj £ R. For example for the two point correlation function 
<&(x, y) = [x,y £ D,r < d(x,y) < r + dr], where d(x,y) is the distance between the two points, 
and [condition] equals 1 when condition holds, otherwise. Ensemble averages can be estimated 
via factorial moment measures, v s flDaley fc Vere- Jones 1972 , Ripley 1988| ). In the Poisson limit 



u s = X s p s , where p s is the s dimensional Lebesgue measure. 
The general covariance of a pair of estimators is 

{D^Rfl^Rf) = £ ^ (^yfy ^fjjlSi+jX^+^p^-i, (5) 

with 

S k = J ® a (%l ■■■%ki Vk+x ■ ■ ■ yN)$b(xi ...x k , z k+ i . . . z N )p 2 N-k (6) 

Here a and b denote possibly two different radial bins, or even different statistics. The 
expression describes the z-fold degeneracy in the p\ + P2 data points from D, as well as the j'-fold 
degeneracy in the q\ + q<i random points drawn from R. For each of these configurations the 
geometric phase-space Si+j is different, and the shot noise contribution of each is appropriately 
summed. The dependence of Sk on a, b, and N is not noted for convenience, but they will be 
assumed throughout the paper. An estimator for the generalized A^-point correlation function is 

^ = |E(^)(-) JV - l (f) i (f) 7V ^ w 
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where S = J &pn (without subscript). This definition can be expressed as (D — R) , where 
~ means normalization with X N , p N respectively. In this symbolic iVth power, each factor is 
evaluated at a different point. Simple calculation in the limit of zero correlations yields (wn) = 0. 
For the same reason, the disconnected parts did not have to be subtracted, which is an important 
simplification in the calculation for N > 4. The covariance between two estimators can be 
evaluated as 

<™> - „e GO ft) ft>f j '*) ( N ; y^-'^-r->>->> <s) 

In the interesting limit, where p — > oo only j = survives. Changing the order of summation 
yields 

{w a ,NW biN ) = SiX'Hlf^i, (9) 

i 

with 

/»=i:(3@(-^- po) 

This latter can be identified as the coefficients of J2n( x v) N i therefore fNi = 5m- Since (wjs?) = 0, 
the final result is ^ 

(co)Varu; i v = ( n ) 

Note that this formula represents both variance and covariance depending on whether in the 
definition of Sn the implicit indices a and b are equal or not. 

While in the above Poisson model the total number of galaxies in the survey can vary, it is 
fixed in the multinomial model. This latter case corresponds to surveys, that detect a certain 
number of galaxies, and use that to estimate also the mean density, i.e. estimator becomes 
conditional given the number of galaxies. This introduces some correlations compared to Poisson, 
which can be taken into account for further precision, especially when the number of galaxies 
in the survey is relatively small. The normalization of the proper estimator changes slightly: 
A* — > (n)i/v l , where n is the total number of objects in the survey. This normalization renders the 
estimator unbiased, even when an external estimator for the mean is inavailable. On the other 
hand, the above Poisson estimator has a slight bias, to leading order oc 0(N(N + l)/2n), if an 
internal estimator for the mean is used. For a multinomial process the factorial moment measure 
is (n)NV~ N pn, with v, the volume of the survey, and (n)jv = n(n — 1) . . . (n — N + 1), the iV-th 
falling factorial. The covariance 



After further reduction (which will be presented elsewhere: [Szapudi &: Szalay 1997b| ) the final 
result is ^ 

(co)Var^ = -lH (13) 
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For N = 2 this coincides with LS, taking into account that S 2 = Sq ~ (S2V 2 ) 2 up to the integral 
constraint, which was neglected during the previous calculation. Again, the variance is inversely 
proportional to the number of possible A-tuplets, i.e. Poisson. 



4. Discussions 

The meaning of the proposed estimator can be understood by specifying the function <£, such 
that it is 1 when the A-tuplet satisfies a certain geometry (with a suitable bin width), and 
otherwise. In this case the estimator will yield the total (or disconnected) A-point correlations 
of the fluctuations of the underlying field 5. For A > 4 these contain extra terms compared to 
the connected correlations, but not as many as the full correlations of the field 1 + 5. However, 
this is the class of estimators which can be precisely corrected for edge effects. The connected 
correlations can be found by subtracting all the possible partitions. Although this is not simple 
to take into account in a general fashion it is feasible for any concrete estimator. For example 
for A = 4 the connected part is (<5i<52<53<54) c = (5i ^^s^) — ((^i^) (^3^4) + sym.). The necessary 
subtraction will induce extra variance compared to the original estimator, however, its order will 
still remain X~ N . Partitioning keeps powers of 5 unchanged thus it does not introduce lower 
order terms according to the arguments given in §2; the constant of proportionality might change 
though. In summary, the connected version of our estimator for the A-point correlations will still 
have a Poisson variance in the Poisson limit, i.e. it is fully corrected for edge effects. 

While so far our discussion focused on the A-point correlation functions, the general nature 
of the assumed function $ can incorporate many of the statistics used in the literature, and 
beyond: Fourier correlation functions (A — 1 spectra), moments of counts in cells, and factorial 
moment correlators are obvious candidates. For the sake of illustration, let us briefly mention 
the suitable $ functions, which correspond to the above quantities. For the A — 1 spectra 
<t(xi,...,xjv) ne lXj ' fcj '/N!, where a refers to the sum over all possible permutations of the 
points Xi to render the function symmetric. Possibly, weights can be incorporated as well. Then 



the definition for A = 2 is equivalent to Feldman, Kaiser, & Peacock 1994 up to the shot noise 



power, which is not present in our estimator, since overlapping indices are conveniently excluded 
in the definition. Moments of counts-in-cells can be represented similarly. The A-th factorial 
moment in a cell is estimated by $ = 1 if a certain A-tuplet fit into the cell, otherwise. For 
A = 2 this method will estimate Ripley's Kq function, or the average of the two-point function 
over a cell. Our estimator therefore will give from which the cumulants Qn oc can 

be straightforwardly calculated. Finally, if $ = 1 for a set of A + M points fit into a pair of 
cells separated by distance r, and otherwise, factorial moment correlators are estimated. The 



proposed estimator gives (5^5% ), from which the cumulant correlators ( Szapudi Sz Szalay 1997a ) 



are simply obtained as Qnm oc (^5^5%^ . Generalization for k-th order joint moments is obvious. 

The fact that the estimator is not exactly connected is of little importance for the Appoint 
correlations and their Fourier counterparts. It is unlikely that in the near future these methods 
can be pushed much beyond N = 4, and the calculation of the A-point function from the 
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estimator of Equation [| is quite straightforward. The other two statistics can be measured 
up to higher order: it was demonstrated both in simulations flColombi, Bouchet, fc Hernquist 



1995 , Baugh, Gaztahaga, fc Efstathiou 1995| ) and galaxy catalogs ( Peebles 1980 , [Szapudi, Szalay 



Boschan 1992, Paztahaga 1992 , Bouchet et al. 1993| , Meiksin, Szapudi, & Szalay 1992, Paztanaga 



1994, [Szapudi et al. 1995| , Szapudi, Meiksin, &; Nichol 1996; ), that the moments can be extracted 



up to 10th order. However, the connected moments, that is cumulants and cumulant correlators 



( Szapudi fc Szalay 1997a ), can be calculated simply via generating functions from our estimator. 
Since the correction for connected moments does not introduce lower order terms in the variance, 
the resulting statistics will be better corrected for edge effects, than any previous methods based 



on moments of counts in cells, which cannot be corrected for edge effects (Szapudi & Colombi 



1996 , Colombi, Szapudi, fc Szalay 1997 ) 



So far we proved rigorously that the proposed edge correction is valid for a Poisson 
distribution. The calculation shows how most of the shot noise is eliminated during this process. 
The eliminated discreteness error is also an edge effect, and it can be corrected for. We propose 



the term "edge-discreteness" effect, to further refine the classification of Szapudi Sz Colombi 1996 
Since previous estimators were not corrected for this extra shot noise- variance, we conjecture that 
they fare worse even when correlations are present. Although we did not prove this rigorously (that 



would require the generalization of Bernstein 1994 for higher order correlations), the arguments 



of §2 show that this is the case. The actual formulae for the errors, however, can only be applied 
to estimate the discreteness contribution to the error (usually not dominant at large scales except 
for small fields with deep exposures, as HDF, etc.) since it does not take into account the finite 
volume and edge effects. These are always present in a realistic distribution and depend on the 
integral of the correlation function, and on the two-point and higher order correlation functions 
( Szapudi Colombi 1996| ). Next we outline possible generalizations, which could improve on the 



approximations if necessary by taking into account the correlations when performing the ensemble 
averages. 

As explained above, the explicit variance formulae can be applied reliably when the finite 
volume and edge effects are not dominating ( [Szapudi Colombi 1996 ), i.e. a sparsely sampled 



survey, where the integral of the correlation function over the survey volume is small as well. 
Fortunately, edge effects are mostly eliminated by the proposed estimator, while finite volume 
effects can be estimated by actually evaluating the correlation integral over the survey volume. 
However, a more accurate calculation is possible, using factorial moment measures which include 
all the needed higher order correlations, or possibly truncated at the appropriate order if large 
scales are considered. For the variance of the generalized iV-point correlation function, a model 
for up to 2N statistics is needed. Such a calculation for the 2-point function was performed by 



Bernstein 1994. Note however, that any estimation of the variance on higher order correlation 
functions is approximate, because all the systematics, and inaccuracy of the prior model will be 
geometrically amplified, in addition to the fact that the interpretation of the variance becomes less 



and less straightforward as the error distribution deviates from the Gaussian ( [Szapudi &: Colombi 



1996 ). Nevertheless the calculation is possible although tedious and the basic steps are outlined 
below. 
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For a general point process the factorial moment measures can be expressed as vjy = Fn/jn, 
where -F/v's are the reducible iV-point correlation functions. Once this is given, it is only a matter of 
simple but lengthy calculations to express all the ensemble averages needed, using the fact that the 
scaling of the factorial moment measure with A for a general (canonical) point process is identical 
to the Poisson case. The result is quite similar to Equation [12], except the falling factorials and the 
powers of the volume are replaced simply by A~\ and Si will depend additionally on the indices 
i\ and %2- Some simplification can be achieved by expressing Fjv = A iV (l + £ + . . . + £3 + . . . + £at) 
with the irreducible correlation functions, and, because of the assumed symmetry of the functions 
3>, replacing by A^ ^ Cf £j. The s are symmetry factors calculable from generating functions 



( Szapudi Szalay 1993 ) or cluster expansion, and C\ = Cm = l,£i = 1- This makes it possible 
to replace the integrals with irreducible integrals. It would be beyond the scope of this work 
to quote the explicit formula and explore its applicability. Our goal was only to clearly show how 
such calculation can be trivially performed if necessary, the rest is left for subsequent research. 

This Letter proposed a set of edge corrected estimators for the generalized TV-point correlation 
functions. The estimators were defined using the framework provided by recent results in spatial 
statistics, which can handle in a uniform way almost all previously used statistics for characterizing 
higher order clustering. Thus our formalism includes among others the iV-point correlation 
functions, iV-point Fourier transforms (or N — 1-spectra), moments of counts-in-cells, and moment 
correlators. Explicit calculations were performed in the Poisson and multinomial limits to show 
that the variance is approximately Poisson, i.e. in both cases it is inversely proportional to the 
number of possible iV-tuplets. The calculation also pinpointed how previous estimators were not 
even corrected for the "edge-discreteness" effects, and physical arguments were given, that they 
would perform even worse when correlations are present. We proposed ways to estimate the other 
important contributions, the edge and finite volume effects, thus a combination of numerical 
estimates combined with our formulae will yield substantially improved accuracy over previous 
techniques. 
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